Introduction

Abstract

  A report by the FRED (Federal Reserve of St. Louis) on labour market conditions highlighted a drastic change in software engineer job postings within the past 5 years. Indexed on \(Feb 1, 2020 = 100\), the number of postings expoentially increaes peaking in early 2022 (Index = 240). Yet, seemingly just as rapid, the number of postings fell to a low in late 2023. With numerous tech-unicorns annoucing unprecedneted layoff numbers, the tech bubble has appeared to burst. This paper will evalute the software/data engineering market in 2021 and 2023, showing the differences in available roles, postings by location, and employee skillset requirements.

Figure 1: Line chart of Indeed jobs postings with baseline Feb 1, 2020. The chart is seasonally adjusted on historic patterns in 2017-2019. Each series, including the national trend, occupational sectors, and sub-national geographies, is seasonally adjusted separately.

Figure 1: Line chart of Indeed jobs postings with baseline Feb 1, 2020. The chart is seasonally adjusted on historic patterns in 2017-2019. Each series, including the national trend, occupational sectors, and sub-national geographies, is seasonally adjusted separately.

Hypothesis

The two main questions of interest are as follows:

  1. What are the differences between 2021 and 2023 postings
  2. Can we use posting-metadata to predict salaries

The goal of this paper is to provide clarity into why so many engineers have been struggling to find employment opportunities in North America. Three datasets are used to answer the above questions; 2 Kaggle Datasets and a dataset found on Github.

Methods

Data Summary

  The first Kaggle dataset was procured by Yazeed Fares, titled Software Engineering Jobs Dataset. The dataset contains 9380 observations and 8 features and was collected via scrapping LinkedIn Jobs. While the scrapping was performed on Dec. 25, 2023, not all jobs were posted on that specific as LinkedIn retains job postings can stay open for up to 6 months. For the purpose of this exploration, we will assume the jobs in this dataset are respresentative of postings from the latter half of 2023.

  The second dataset was developed by Mike Lawrence, a Machine Learning Engineer at Google. The dataset contains 8261 observations and 13 features. Similarily, this dataset was also collected from scraped LinkedIn postings; collected in October 2021.

  Both datasets contain metadata for postings, but not all columns align, and hence we will subset both datasets into matching features for the purpose of comparison. The variables of interes are listed below. After subsetting the data, all NA values are removed.

Figure 2: Summary and Description of Variables of Interest for 2021 and 2023 Datasets
Variables Type Description
Company character Name of Company
Description character Description of job including but not limited to company overview, requirements, skillset
Title character Name of position
Location character Location of Job
Seniority character Classification of role based on experience, technical expertise, leadership responsibilities
Year factor Year the Job was Posted

Data Wrangling

Titles

  Due to the structure of job titles additional data-wrangling as requirement to get proper categoriazation. For example, variance between each posting could result in different titles corrospond to the same type of position (e.g. Sr. Software Engineer vs. Senior Software Engineer). This would affect group_by() functions, resulting in many more categories than necessary. As a result, custom title’s are defined based on keyword matches. Using the case_when() function, titles are classified from spefic to generic. Thus a title such as “Front-end Software Engineer” gets classified as Frontend Software Engineer rather than Software Engineer.

Figure 3: Regex used to create classification levels for posting titles.
Title Pattern
Back End Engineer Back-end, backend
Cloud Engineer Cloud
Data Engineer Data
Data Scientist Data Scientist
DevOps Engineer Devops
Embedded Systems Engineer Embedded, System
Front End Engineer Front-end, front end, frontend
Full Stack Engineer full stack|full-stack
Machine Learning Engineer Machine Learning, AI, Artificial Intelligence
Mobile Software Engineer Mobile, iOS, Android
Other .*
QA Engineer Test, Quality, QA
Research Engineer Research, Scientist
Security Engineer Security, Cyber
Site Reliability Engineer Site Reliability, site-reliability
Software Engineer Software

Seniority

  Analogous to titles, seniority levels are rather inconsistent between the 2 datasets. In the 2021 dataset are 8 levels of seniority while the 2023 dataset only contains 2 classifications. In order to maintain homogeneity between the classifications in both datasets, custom Seniority levels are defined based on keywords in the title (e.g. Staff ~ Staff Level).

Figure 4: Regex used to create classification levels for posting Seniority Level
Seniority Pattern
Principal Principal
Staff Staff
Lead Lead
Senior Sr., Sr, Senior, III
Founding Founding
Manager Manager
Junior Entry Level, Junior, Entry-Level, Graduate, Jr., II, Jr, I
Junior Entry level, Associate
Senior Mid-Senior level
None Specified .*

GeoData

  Additionaly wrangling as required to plot Postings Count ~ Location. The provided location data only contains posting location formatted by City, State. With a little experimentation, the data was best visualized, grouped by state and hence, latitude and longitude coordinate translation was required. After extracting the State from each location string, the Google Geocoding API is used to place coordinates for each state. Entries located outside of the US or with incorrect formatting were dropped leaving 7258 from 2021 and 6948 from 2023.

Premilinary

  The following figures represent EDA on our variables of interest. All varaibles are either factors or textual, hence visualizations are limited to barcharts listing the (top-n) counts grouped by each feature.

 Figure 5 shows the comparison of postings in 2021 compared to 2023. Note that this list is not exhaustive of all postings in 2021/2023 and shouldn’t be taken as contradictory evidence to the hypothesis. At the time of collection, this was the number of postings available in each year on LinkedIn. It is very possible that the scrapper missed postings, or volumes are lower/higher at the given point of time the scrapper pulled the dataset.

Figure 5: Comparison of number of postings in 2021 and 2023.

Figure 5: Comparison of number of postings in 2021 and 2023.

  Software Engineer, was the most popular role in both years, but the key difference is a lack of specificity in roles in 2023. The 2021, dataset have over 1000 occurances of MLE, Site Reliability Engineer and Data Scientist postings while the second most in the 2023 dataset is 700 postings of Embedded Systems Engineers. If we disregard the notion that the 2023 dataset wasn’t scrapping for Data Science related roles, the difference between the 2 years isn’t as drastic as we hypothesized.

Figure 6: Comparison of top 10 posting titles in 2021 cs 2023.

Figure 6: Comparison of top 10 posting titles in 2021 cs 2023.

  One noticeable difference between the two years is the desired seniority level. In 2021 there were 2687 postings for entry-level/junior engineer roles and was the most frequent seniority. However, The 2023 dataset saw a large shift towards senior roles with the vast majority of postins being for Senior Engineer (4183) and also increased increased postings for Staff, Principal and Lead Engineer roles.

Figure 7: Comparison of seniority counts for postings in 2021 vs 2023

Figure 7: Comparison of seniority counts for postings in 2021 vs 2023

  Althought this may be attributed to timing of data collection, companies posting in 2021 are more traditional “big-tech” while most postings from 2023 are more scattered. Apple held of 600 postings followed by Microsoft, Uber, Salesforce all with ~100 postings respectively. According to Layoffs.fyi 1,036 tech companies laid off a total of 238,397 employees in the first nine months of 2023. This aligns with the data, showing a lack of postings from popular Saas and Tech Giants. In fact, the most number of postings in 2023 comes from Jobs for Humanity - a platform for “Connecting historically under represented talent to welcoming employers across the globe”.

Figure 8: Companies of company counts in 2021 vs 2023.

Figure 8: Companies of company counts in 2021 vs 2023.

  Figure 9 represents the geographic locations job opportunities. Each bubble, is indicative of the number of postings in the given state; relative to other bubbles on the map. There isn’t a significant difference between the two years, both seeing the largest number of postings in California, Texas and New York - the “tech hubs” of the U.S.

Figure 9: Maps of The United States showing relative postings counts by state.

Figure 9: Maps of The United States showing relative postings counts by state.

Summary

Findings

  From an initial breakdown of datasets and variables, it was discovered that 2021 and 2023 saw slight differences in metadata associated with postings. Location counts, was the only consistent factor between the two years, while Titles, Seniority and Companies data all support the argument of an increased difficulty for job-seekings in 2023. Many popular destinations didn’t post opportunities for New Grads / Entry Levels and sought more senior or leadership positions.

Future steps.

  For the next step of the project, it will tackle NLP and Prediction models. Job skilsets, years of experience, and sentientment can be extracted and compared for the two years. An additionaly dataset containing salaries of SWE jobs in 2023 will be introduced to compare wages. A numeric feature, allows for further exploration of variable relationships such as Title~Salary, Location~Salary, Company~Salary. Resultingly, MLR, GLMM, and Boosting models will be trained on the new data to answer question 2 of our hypothesis - salary prediction.